2025-01-08

Welcome

R on High-Performance Computing (HPC)

Audience: Clinical researchers, biologists, and chemists new to HPC scripting.

Goal: Show how R can be used efficiently on HPC systems like the St. Jude HPCF.

🍳 HPC as a Metaphor

Think of HPC as a gourmet kitchen with many chefs.

  • Laptop: one chef with one pan.
  • HPC: 32 chefs, 32 pans, 1 recipe split up = faster meal!

Key Concept: Split your work so many nodes can help.

CPU (Central Processing Unit) vs CPU Core

A CPU, or Central Processing Unit, is the main “brain” of a computer. A CPU core is a single processing unit within that CPU.

Think of the CPU as a house and the cores as rooms within that house. Each core can independently execute instructions, allowing the CPU to handle multiple tasks simultaneously.

Modern CPUs often have multiple cores (dual-core, quad-core, etc.), enabling greater multitasking capabilities.

CPU (Kitchen) vs CPU Cores (Chefs)

Think of HPC Like a Restaurant Kitchen I

Think of HPC Like a Restaurant Kitchen II

📡 Job Monitoring (LSF States)

  • The LSF (Load Sharing Facility) is like the restaurant maître d’ who keeps track of who’s seated, who’s still waiting, and who’s finished their meal.

Logging into St. Jude HPC

ssh your_username@hpc.stjude.org
  • Use Terminal in Mac or MobaXterm in Windows.

🔐 Logging nodes and 🧠 Compute Nodes

  • Use the login node only to prepare and submit jobs — not to run heavy tasks!

📋 HPC Architecture

  • Login node: Entry point
  • Compute nodes: Where your R code runs
  • Scheduler (LSF, Load Sharing Facility): Manages job distribution
  • In LSF, bsub is a command used to submit jobs, often specifying resource requirements and other parameters(the IBM documentation).
bsub < job.sh     # submit a job
bjobs             # check job status
bqueues           # view queue info
bkill <jobID>     # kill a job

Bqueues Layout

📊 HPC Queues vs Login/Compute Nodes I

📊 HPC Queues vs Login/Compute Nodes II

  • Queues define the rules and policies for job scheduling, like job duration, priority, resource type (e.g., GPU), or department usage.
  • Queues are not job types; instead, they’re more like “lines with rules”—some lines move faster (priority), some serve special equipment (GPU), and some are restricted (lab-specific).
  • Nodes are the actual hardware — CPUs, RAM, GPUs — where the work gets done.
  • Jobs go from login node → queue → compute node.

Example:

  • interactive queue: Allows interactive sessions like RStudio
  • gpu_priority: For GPU-enabled jobs, with higher urgency
  • compbio: A queue shared by Computational Biology groups
  • priority: A queue with expedited scheduling
  • large_mem: Reserved for high-RAM needs (e.g., 500+ GB)

🧭 Walkthrough: 3 Real-World Examples

Example Description Mode Why it Matters
✅ Small R Script Read & summarize cancer dataset Interactive & Batch Illustrates two job submission styles
🚀 Bootstrap Cox models on 1000 bootstraps Batch + Parallel Shows clear speed-up using HPC
🧠 GPU Matrix Mult Matrix product on GPU GPU + Batch Showcases GPU for numeric computing

🧪 Example 1: Small R Job – Interactive Mode

$ bsub -P survival_analysis -q interactive -Is bash
$ module avail R
$module load R/4.4.0
$R
# Inside R session
library(dplyr)
library(survival)
data(cancer, package = "survival")
head(force(cancer))
##   inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1    3  306      2  74   1       1       90       100     1175      NA
## 2    3  455      2  68   1       0       90        90     1225      15
## 3    3 1010      1  56   1       0       90        90       NA      15
## 4    5  210      2  57   1       1       90        60     1150      11
## 5    1  883      2  60   1       0      100        90       NA       0
## 6   12 1022      1  74   1       1       50        80      513       0
# inst: Institution code
# time: Survival time in days
# status:   censoring status 1=censored, 2=dead
# age:  Age in years
# sex:  Male=1 Female=2
# ph.ecog:  ECOG performance score as rated by the physician. 0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed <50% of the day, 3= in bed > 50% of the day but not bedbound, 4 = bedbound
# ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
# pat.karno:    Karnofsky performance score as rated by patient
# meal.cal: Calories consumed at meals
# wt.loss:  Weight loss in last six months (pounds)

results <- cancer %>% group_by(sex) %>%
  summarise(survival_rate = mean(time))
print(results)
## # A tibble: 2 × 2
##     sex survival_rate
##   <dbl>         <dbl>
## 1     1          283.
## 2     2          339.
# write.csv(results, "../../output/survival_analysis_interactive.csv")

🧪 Example 1: Small R Job – Batch Mode

survival_analysis.R

job1.sh

#!/bin/bash
#BSUB -n 1
#BSUB -q priority
#BSUB -W 00:10
#BSUB -R "rusage[mem=2048]"
#BSUB -J survival_analysis
#BSUB -o logs/output.%J.log
#BSUB -e logs/error.%J.log

BASE_DIR=/research/rgs01/home/clusterHome/zqu/workshop07252025/code/

module load R/4.4.0
Rscript R/survival_analysis.R

🚀 Example 2: Bootstrap with Parallelization (Test in Local)

parallel_bootstrap.R

library(survival)
library(doParallel)
library(foreach)

data(cancer, package = "survival")

# ----- 1. Parallel Execution -----
start_parallel <- Sys.time()

cl <- makeCluster(8)
registerDoParallel(cl)

coefs_parallel <- foreach(i = 1:10000, .combine = rbind, .packages = c("survival")) %dopar% {
  data(cancer, package = "survival")
  samp <- cancer[sample(nrow(cancer), replace = TRUE), ]
  coef(coxph(Surv(time, status) ~ age + sex, data = samp))
}

stopCluster(cl)

end_parallel <- Sys.time()
parallel_time <- end_parallel - start_parallel
time_value <- as.numeric(parallel_time)
time_unit <- attr(parallel_time, "units")
cat(sprintf("⏱️ Time taken (parallel): %.2f %s\n", time_value, time_unit))

# ----- 2. Sequential Execution -----
start_seq <- Sys.time()

coefs_seq <- matrix(NA, nrow = 10000, ncol = 2)
for (i in 1:nrow(coefs_seq)) {
  samp <- cancer[sample(nrow(cancer), replace = TRUE), ]
  fit <- coxph(Surv(time, status) ~ age + sex, data = samp)
  coefs_seq[i, ] <- coef(fit)
}

end_seq <- Sys.time()
sequential_time <- end_seq - start_seq
time_value <- as.numeric(sequential_time)
time_unit <- attr(sequential_time, "units")
cat(sprintf("⏱️ Time taken (sequential): %.2f %s\n", time_value, time_unit))

# Optional: Save outputs if needed
write.csv(coefs_parallel, "../output/bootstrap_coefs_parallel.csv", row.names = FALSE)
write.csv(coefs_seq, "../output/bootstrap_coefs_sequential.csv", row.names = FALSE)

🚀 Example 2: Bootstrap with Parallelization (Test in HPC)

bootstrap_job.sh

#BSUB -J r_parallel_bootstrap
#!/bin/bash
#BSUB -q priority
#BSUB -n 10
#BSUB -R "span[hosts=1]"        # All cores on one node
#BSUB -W 00:30  # Runtime limit hh:mm
#BSUB -R "rusage[mem=100000]"
#BSUB -o logs/output.%J.log
#BSUB -e logs/error.%J.log

BASE_DIR=/research/rgs01/home/clusterHome/zqu/workshop07252025/code/

module load R/4.4.0
Rscript R/bootstrap_parallel_hpc.R

🔍 CPU vs GPU: Key Differences

Feature CPU GPU
Cores Few (2–64) Hundreds to thousands
Task Type Sequential tasks Parallelizable tasks
Latency Low (good for logic) Higher (good for throughput)
Memory Hierarchy Complex, flexible Simpler, faster bandwidth
Best For General-purpose processing Large-scale matrix operations

⚙️ Parallel Computation: CPU vs GPU in R

CPU Parallel (doParallel) GPU Parallel (gpuR)
Parallelism Type Multi-core Many-core (SIMD)
Ideal Use Case Independent iterations Matrix ops, deep learning
Setup Complexity Low to Medium Medium (GPU config required)
Speed Improvement Moderate High (if vectorized)
Limitations Memory bottlenecks Data transfer & library limits
  • Without SIMD (CPU style): You make one cookie at a time — scoop dough, shape, bake, repeat.

  • With SIMD (GPU style): You have a tray with 100 molds and pour dough into all of them at once — one instruction, multiple cookies baked together.

🧪 Matrix Inversion in R (Interactive HPC Node)

bsub -q gpu_interactive -n 1 -Is bash
module load R/4.1.2
library(gpuR)
library(microbenchmark)

# Matrix dimensions
n <- 1000

# GPU inversion
gpu_time <- microbenchmark({
  # 1. Generate data in CPU memory
x <- matrix(rnorm(n^2), nrow = n)

# 2. Transfer to GPU
mat_gpu <- gpuMatrix(x, type = "float")

# 3. Perform operations
inv_gpu <- solve(mat_gpu)

}, times = 1)

cat(sprintf("🧠 GPU Time: %.2f %s\n", gpu_time$time[1]/1e9, "seconds"))
  • GPU Time: 1.51 seconds

  • CPU Time: 1.89 seconds

🧠 Example 3: GPU Matrix Multiplication

gpu_inverse_job.R

gpu_job.sh

#!/bin/bash
#BSUB -J gpu_inv_array[1-10]          # Job array: 10 parallel jobs
#BSUB -q gpu_priority                   # Use appropriate GPU queue
#BSUB -gpu "num=1"                   # Request 1 GPU per job
#BSUB -n 1                           # 1 CPU core is enough
#BSUB -R "rusage[mem=1GB]"           # Memory per job
#BSUB -W 0:15                        # Walltime
#BSUB -o log/gpu_job_%J_%I.out      # Stdout per array task
#BSUB -e log/gpu_job_%J_%I.err      # Stderr per array task

BASE_DIR=/research/rgs01/home/clusterHome/zqu/workshop07252025/code/

cd $BASE_DIR
# Load R module (adjust version as needed)
module load R/4.1.2

# Run R script
Rscript R/gpu_inverse_job.R

🧠 Best Practices

✅ Test on small data first
✅ Use vectorized & parallel functions
✅ Monitor jobs and scale accordingly
✅ Request only what you need
✅ Use job arrays or packages like clustermq, batchtools

📚 Resources

🙌 Thanks!

You can now run R jobs smarter and faster on HPC.
Let the cluster cook for you 🍽️!